class: center, middle, inverse, title-slide .title[ # Normal Distributions and Rescaling ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Normal Distribution --- ## Normal Distribution Def: a particular bell-shaped curve that has the following mathematical properties `\(f(x)= \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}\)` - Formula has two parameters - `\(\mu\)` - `\(\sigma\)` - The standard normal `\((\mu=0; \sigma=1)\)` simplifies the equation --- # Standard Normal with multiple means - The mean is located at the center of the symmetric curve and is the same as the median. - Changing `\(\mu\)` without changing `\(\sigma\)` moves the Normal curve along the horizontal axis without changing its variability. .small[ <img src="data:image/png;base64,#rnorms_files/figure-html/norm-1.png" width="55%" style="display: block; margin: auto;" /> ] --- .small[ ```r #### Normal Distribution # Display the normal distributions with various means x <- seq(-80, 80, length=1000) hx <- dnorm(x) colors <- c("red", "blue","green", "green", "gold", "black") plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Comparison of Normal Distributions",xlim=c(-5, 7)) location<-c(2,4,-2) for (i in 1:3){ lines(x, dnorm(x,mean=location[i]), lwd=1, col=colors[i]) } ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-2-1.png" width="40%" style="display: block; margin: auto;" /> ] --- # Plots of the Standard Normal with multiple standard deviations .pull-left[ - The standard deviation `\(\sigma\)` controls the variability of a Normal curve. - When the standard deviation is larger, the area under the normal curve is less concentrated about the mean. - The standard deviation is the distance from the center to the change-of-curvature points on either side. ] .pull-right.small[ <img src="data:image/png;base64,#rnorms_files/figure-html/zsd-1.png" width="90%" style="display: block; margin: auto;" /> ] --- .small[ ```r # Display the normal distributions with various standard deviations plot(x, hx, type="l", lty=2, xlab="x value", ylab = "Density", main="Comparison of Normal Distributions",xlim=c(-10, 10)) for (i in c(.5,2,4,6)){ lines(x, dnorm(x,sd=i), lwd=1, col=colors[i]) } ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-3-1.png" width="40%" style="display: block; margin: auto;" /> ] --- # Normal Distribution .pull-left[ - In the Normal distribution, with mean `\(\mu\)` and standard deviation `\(\sigma\)`: - approximately `\(68\%\)` of the observations fall within 1 `\(\sigma\)` of `\(\mu\)` - approximately `\(95\%\)` of the observations fall within 2 `\(\sigma\)` of `\(\mu\)` - approximately `\(99.7\%\)` of the observations fall within 3 `\(\sigma\)` of `\(\mu\)` - This property is sometimes called: The `68-95-99.7 Rule` ] .pull-right[ <img src="data:image/png;base64,#../img/normal.png" width="95%" style="display: block; margin: auto;" /> ] --- <img src="data:image/png;base64,#../img/normal.png" width="70%" style="display: block; margin: auto;" /> --- # Worked Example .pull-left[ - The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for seventh-grade students in Gary, Indiana, is close to Normal. - Suppose the distribution is `N(6.84, 1.55)`. ] -- .pull-right[ .question[ - Q. What is the mean of the distribution? - Q. What is the standard deviation of the distribution?] ] -- <img src="data:image/png;base64,#../img/norms.png" width="55%" style="display: block; margin: auto;" /> --- # Worked Example .pull-left[ - The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for seventh-grade students in Gary, Indiana, is close to Normal. - Suppose the distribution is `N(6.84, 1.55)`. ] -- .pull-right[ - Sketch the Normal density curve for this distribution. - Q. What percent of ITBS scores is between 3.74 and 9.94? - Q. What percent of the scores is below 3.74? ] -- <img src="data:image/png;base64,#../img/norms.png" width="55%" style="display: block; margin: auto;" /> --- .question[Check your understanding: What percent of the scores is above 5.29?] <br> <img src="data:image/png;base64,#../img/norms.png" width="100%" style="display: block; margin: auto;" /> --- # Standard Normal - Normal is a model of the real world - Not exact, but it is a facile model for many things - Physical features - Psychological features - Performance measures -- - Not all variables are normal - Skewed variables (e.g. income) - Any count variable (number of kids, mistakes on an exam) --- # Real World Data .pull-left[ - Many variables follow this distribution ( but not all) - I have plotted histograms of data, - we have already used in this class - overlaid with the standard normal. ] -- .small.pull-right[ <img src="data:image/png;base64,#rnorms_files/figure-html/example-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Height of Children (Galton dataset) .small[ ```r library(HistData) library(ggplot2) library(tidyverse) Galton %>% ggplot(aes(x = child)) + geom_histogram(fill = "red") + stat_function( fun = function(x, mean, sd, n){ n * dnorm(x = x, mean = mean, sd = sd) }, args = with(Galton, c(mean = mean(child), sd = sd(child), n = length(child))) ) + scale_x_continuous("Heights of Children") + theme_minimal() ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-9-1.png" width="40%" style="display: block; margin: auto;" /> ] --- # IMBD movie ratings (movies dataset) .pull-left.midi[ ```r library(ggplot2movies) data(movies) ggmovie <- ggplot(movies, aes(x = rating)) + geom_histogram(fill = "blue") + geom_freqpoly(aes( x = rnorm(length(rating))*sd(rating) + mean(rating)), fill = "black") + scale_x_continuous("IMBD Movie Ratings") + theme_minimal() ``` ] .pull-right[ <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Temperature in Nottingham (nottem dataset) <img src="data:image/png;base64,#rnorms_files/figure-html/nottem-1.png" width="70%" style="display: block; margin: auto;" /> --- # Standard Normal - Normal distribution tricks - Symmetric - 50% of area above zero - Total proportion is 1.0 (or 100%) --- # Area under the Normal Distribution <img src="data:image/png;base64,#../img/normal.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Wrapping Up... --- class: middle # Rescaling --- # Rescaling - All Normal distributions are the same - if we measure in units of size `\(\sigma\)` from the mean `\(\mu\)` as center. - We can convert any variable into the same metric as the standard normal - Changing to these units is called standardizing or rescaling. --- # Z-Score - Z-score describes the location of the raw score in terms of distance from the mean, measured in standard deviations .pull-left[ - Statistics Sample - `\(z_{i}\)` = `\(\frac{x_{i}-\bar{x}}{s}\)` ] -- .pull-right[ - Population - `\(z_{i}\)` = `\(\frac{x_{i}-\mu}{\sigma}\)` ] --- # Merits .pull-left[ - Advantages - Allows us to compare scores on a common metric - Origin is 0. The mean - The units are 1, the standard deviation - '+' values above the mean - '-' values below the mean ] -- .pull-right[ - We can compare across measurement scales - Shape of the distribution does NOT CHANGE - We can go from z-scores to raw scores ] --- # Demo .pull-left[ .small[ ```r library(ggplot2movies) # Raw data movies$rating[1:10] ``` ``` ## [1] 6.4 6.0 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6.0 ``` ```r # Rescaling variable <- movies$rating scale(variable)[1:10] ``` ``` ## [1] 0.30079877 0.04323788 1.45982279 1.45982279 -1.63090793 ## [6] -1.05139592 -0.40749368 0.49396944 0.42957922 0.04323788 ``` ] - Mean = 5.93 (SD = 1.55) ] -- .pull-right[ .midi[ | Raw| Z_Score| |---:|-------:| | 6.4| 0.30| | 6.0| 0.04| | 8.2| 1.46| | 8.2| 1.46| | 3.4| -1.63| | 4.3| -1.05| | 5.3| -0.41| | 6.7| 0.49| | 6.6| 0.43| | 6.0| 0.04| ] ] --- # Demo .pull-left[ ```r plot(density(variable)) # no scaling ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(density(scale(variable))) # with scaling ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Z-Score - Gives us information about the location of that score relative to the "average" deviation of all scores - A z-score is the number of standard deviations a score is above or below the mean of the scores in a distribution - A raw score is a regular score before it has been converted into a Z score - Raw scores on very different variables can be converted into Z scores and directly compared --- # Worked Z-Score Problem - Here are the IQ test scores of 31 7th-grade girls in a Midwest school district. <br> <img src="data:image/png;base64,#../img/iqz.png" width="70%" style="display: block; margin: auto;" /> --- # Worked Z-Score Problem A) We expect IQ scores to be approximately Normal. -- - Make a stem plot to check that there are no major departures from normality. -- <img src="data:image/png;base64,#../img/iqstem.png" width="65%" style="display: block; margin: auto;" /> --- # Worked Z-Score Problem B) Find the mean and standard deviation -- - Mean =105.84 = `\(\sum \frac{X_{i}}{n}\)` = 3281/31 - SD = 14.27 = `\(s^{2}\)` = `\(\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})^{2}}{n-1}\)` = `\(s^{2}\)` = `\(\frac{\sum^{n}_{i=1}(x_{i}-105.84)^{2}}{30}\)` --- # Worked Z-Score Problem C) What proportion of scores are within one standard deviation of the mean? - One SD above mean = 105.84 + 14.27 = 120.11 - One SD below mean = 105.84 - 14.27 = 91.57 - 23/31 = 0.74 -- <img src="data:image/png;base64,#../img/workz.png" width="65%" style="display: block; margin: auto;" /> --- # Worked Z-score Problem B) What proportion of scores are within TWO standard deviations of the mean? - TWO SD above mean = 105.84 + 2*(14.27) = 134.38 - TWO SD below mean = 105.84 - 2*(14.27) = 77.3 - 29/31 = 0.935 --- # Worked Z-score Problem B) What would these proportions be in an exactly Normal distribution? - +/- One SD? <img src="data:image/png;base64,#../img/table.png" width="90%" style="display: block; margin: auto;" /> --- # Moving to powerpoint --- ## Cumulative Proportions - The cumulative proportion for a value `\(x\)` in a distribution is the proportion of observations in the distribution that are less than or equal to `\(Z\)`. <img src="data:image/png;base64,#../img/table.png" width="90%" style="display: block; margin: auto;" /> --- ``` ## # A tibble: 1 × 7 ## format width height colorspace matte filesize density ## <chr> <int> <int> <chr> <lgl> <int> <chr> ## 1 PNG 2492 2020 sRGB TRUE 0 130x130 ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-21-1.png" width="90%" style="display: block; margin: auto;" /> ## Worked Z-Score Problem ### Example 1: Within One Standard Deviation  The area between -1 and +1 standard deviations from the mean: ```r area_within_one_sd <- 0.8413 - 0.500 print(paste("Area within one SD:", area_within_one_sd)) ``` ``` ## [1] "Area within one SD: 0.3413" ``` ```r total_area <- area_within_one_sd * 2 print(paste("Total area within ±1 SD:", total_area)) ``` ``` ## [1] "Total area within ±1 SD: 0.6826" ``` <img src="data:image/png;base64,#rnorms_files/figure-html/unnamed-chunk-23-1.png" width="90%" style="display: block; margin: auto;" /> --- ### Example 2: Within Two Standard Deviations  The area between -2 and +2 standard deviations from the mean in a Normal distribution: ```r area_within_two_sd <- 0.9773 - 0.500 print(paste("Area within two SD:", area_within_two_sd)) ``` ``` ## [1] "Area within two SD: 0.4773" ``` ```r total_area_two_sd <- area_within_two_sd * 2 print(paste("Total area within ±2 SD:", total_area_two_sd)) ``` ``` ## [1] "Total area within ±2 SD: 0.9546" ``` --- ## Starting from a Proportion SAT reading scores for a recent year are distributed according to an N(504, 111) distribution. How high must a student score to be in the top 10% of the distribution?  --- ### Normal Calculations ```r # Create a data frame to represent the table from the image z_table <- data.frame( z = c(1.1, 1.2, 1.3), `0.07` = c(0.8790, 0.8980, 0.9147), `0.08` = c(0.8810, 0.8997, 0.9162), `0.09` = c(0.8830, 0.9015, 0.9177) ) print(z_table) ``` ``` ## z X0.07 X0.08 X0.09 ## 1 1.1 0.8790 0.8810 0.8830 ## 2 1.2 0.8980 0.8997 0.9015 ## 3 1.3 0.9147 0.9162 0.9177 ``` ```r # Find z-score for top 10% z_score <- qnorm(0.90) print(paste("z-score for top 10%:", round(z_score, 2))) ``` ``` ## [1] "z-score for top 10%: 1.28" ``` ```r # Calculate the actual score mean_score <- 504 sd_score <- 111 required_score <- mean_score + z_score * sd_score print(paste("Required score for top 10%:", round(required_score, 2))) ``` ``` ## [1] "Required score for top 10%: 646.25" ``` A student would have to score at least 646.08 to be in the top 10% of the distribution of SAT reading scores for this particular year. ## "Backward" Normal Calculations Steps for using Table A given a Normal proportion: 1. State the problem in terms of the given proportion. Draw a picture that shows the Normal value, `\(x\)`, that you want in relation to the cumulative proportion. 2. Use the table, the fact that the total area under the curve is 1, and the given area under the standard Normal curve to find the corresponding `\(z\)`-value. 3. Unstandardize `\(z\)` to solve the problem in terms of a non-standard Normal variable `\(x\)`.  --- ## Functions in R for Normal Distribution - `dnorm()`: gives the density - `pnorm()`: gives the cumulative density function - Computes the probability that a normally distributed random number will be less than that number - `qnorm()`: gives the quantile function - Is the inverse of pnorm, give it a probability, it produces the number whose cumulative distribution matches the probability - `rnorm()`: generates random deviates  --- ## Illustrations with R Let's demonstrate these functions: ```r # Density at the mean of a standard normal distribution print(paste("Density at mean of standard normal:", dnorm(0))) ``` ``` ## [1] "Density at mean of standard normal: 0.398942280401433" ``` ```r # Cumulative probability at 1 SD above the mean print(paste("Cumulative probability at 1 SD above mean:", pnorm(1))) ``` ``` ## [1] "Cumulative probability at 1 SD above mean: 0.841344746068543" ``` ```r # Value at the 90th percentile of a standard normal print(paste("90th percentile of standard normal:", qnorm(0.9))) ``` ``` ## [1] "90th percentile of standard normal: 1.2815515655446" ``` ```r # Generate 5 random numbers from N(0,1) print("5 random numbers from standard normal:") ``` ``` ## [1] "5 random numbers from standard normal:" ``` ```r print(rnorm(5)) ``` ``` ## [1] 0.67126524 -2.04705517 -0.44531433 -0.09432412 -0.76690284 ``` =